-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-2371: update to beta #5632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2371: update to beta #5632
Conversation
haircommander
commented
Oct 7, 2025
- One-line PR description:
- Issue link: cAdvisor-less, CRI-full Container and Pod Stats #2371
- Other comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of diff but is this still an open question?
* **What specific metrics should inform a rollback?** | ||
###### What specific metrics should inform a rollback? | ||
|
||
The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented above but if kubelet provider is not working, should we expose a metric or something?
If Kubelet is unable to post metrics on a node, it seems difficult to find this out currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the admin attempted to roll out the feature and it failed, the metric saying provider is 'cadvisor' unexpectedly would be the signal that the fallback happened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense and the metric is exposed per node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is still a very difficult thing for someone to detect.
The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled.
So the only way someone who find this out is if a kubelet on a node stopped posting metrics and that pod/container on that node was not found in prometheus.
That seems very complicated to tell if I had 5000 nodes.
Its worth calling out that the rollback failing would be cadvisor but if the metrics are not being posted then what is the best way to find that out? How does one find the bad node via metrics or monitoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still not addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the answer here is above in the GA section where there is:
- Likely, the `cri_losing_support` metric will be used to report that users on versions lower than 2.2 will lose support by a targeted GA version.
Would be good to ensure this is updated here as well.
Please address the verify job failure. PRR shadow: I left some comments but overall I think it is close. |
f04c699
to
0c5a6c4
Compare
thanks @kannon92 updated! |
+1 to @kannon92 comments. We need a clear transition docs that indicate how to transition. Also there should be graceful period of supporting both equally well, maybe even simultaneously. This was the goal of transition documentation goal for beta - make sure it is reviewed and we understand if we need to announce deprecation or can just suggest an easy migration for each metric |
0c5a6c4
to
1b9b218
Compare
I have updated based on comments. I have made a couple of things explicity:
|
Note: we chatted about this in SIG Node and agreed that the stats provider (cadvisor or partial CRI) is an implementation detail, and doesn't currently have any configuration. Introducing configuration to allow an admin to toggle whether we turn on full cri stats just to remove it in a handful of releases doesn't seem worth it. We decided to announce deprecation in beta, and move forward with it in GA when we decide to drop support for containerd < 2.2. We also chatted with @marosset and agreed that windows support wouldn't block this KEP going to beta but we'd best-effort try to include it because this has been a sore spot for windows for a long time. |
#### Alpha -> Beta Graduation | ||
#### Beta | ||
|
||
- Conformance tests for the fields in `/metrics/cadvisor` should be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have critest for metrics we want to be collected? We need some way confirming that users transition will be seamless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
- cAdvisor stats provider will likely be marked as deprecated (depending on dockershim deprecation). | ||
- cAdvisor stats provider support will be dropped, as well as support for partial cri stats provider. | ||
- Feature gate removed and the CRI stats provider will no longer rely on cAdvisor for container/pod level metrics. | ||
- Conformance tests for stats and metrics being present as expected from the new sources, and performance/scale testing should show comparable performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does conformance include specific set of metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not currently AFAIU
1b9b218
to
5845ad0
Compare
Signed-off-by: Peter Hunt <[email protected]>
5845ad0
to
b901bf4
Compare
### /metrics/cadvisor | ||
|
||
1. Expose the metric fields provided in `/metrics/cadvisor` in an analogous Prometheus endpoint directly from the CRI implementation. | ||
1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from from the CRI implementation and reported through the Kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from from the CRI implementation and reported through the Kubelet. | |
1. Expose the metric fields provided in `/metrics/cadvisor` in the same Prometheus endpoint, gathered by Kubelet from the CRI implementation and reported through the Kubelet. |
automations, so be extremely careful here. | ||
Enabling this behavior means some stats endpoints will not be filled: | ||
###### Does enabling the feature change any default behavior? | ||
: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: |
/hold Based on #5632 (comment), waiting on @SergeyKanzhelev lgtm. PRR shadow: Just one item I still think needs more detail on it. #5632 (comment) I think the answer is that in order to GA this feature we cannot have a node fail to publish metrics via cri. As once this feature is GA there is no alternative so a node would miss kubelet metrics and there is no possible way to recover that once we lock the gate to true and drop the kubelet stats provider code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
/unhold |
Based on #5632 (comment), /assign @soltysh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to conditionally approve the PRR, but here are the things that need to be ensured for this work:
- This functionality SHOULD NOT be enabled by default, if containerd doesn't release v2.2 before 1.35 dev cycle ends. Given that v2.2.0-beta.1 was release 5 days ago you should be ok.
- Update the missing bits I've mentioned in the doc (mostly the rollback metric, ideally also the template, but that's a minor problem).
/approve
the PRR section
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) | ||
<!-- /toc --> | ||
|
||
# cAdvisor-less, CRI-full Container and Pod Stats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: but it seems this KEP didn't get updated template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you see a section I missed? I thought I updated the template and I'm not seeing anything missing
* **What specific metrics should inform a rollback?** | ||
###### What specific metrics should inform a rollback? | ||
|
||
The lack of any metrics reported for pods and containers is the worst case scenerio here, and would require either a rollback or for the feature gate to be disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the answer here is above in the GA section where there is:
- Likely, the `cri_losing_support` metric will be used to report that users on versions lower than 2.2 will lose support by a targeted GA version.
Would be good to ensure this is updated here as well.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, haircommander, SergeyKanzhelev, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm not even sure we should do on by default beta until 1.36 anyway, because 2.2 will barely be anywhere when 1.35 releases |
That's very reasonable approach, I'm definitely supportive. |